Goto

Collaborating Authors

 distance measure


CROCS: A Two-Stage Clustering Framework for Behaviour-Centric Consumer Segmentation with Smart Meter Data

Yerbury, Luke W., Campello, Ricardo J. G. B., Livingston, G. C. Jr, Goldsworthy, Mark, O'Neil, Lachlan

arXiv.org Machine Learning

With grid operators confronting rising uncertainty from renewable integration and a broader push toward electrification, Demand-Side Management (DSM) -- particularly Demand Response (DR) -- has attracted significant attention as a cost-effective mechanism for balancing modern electricity systems. Unprecedented volumes of consumption data from a continuing global deployment of smart meters enable consumer segmentation based on real usage behaviours, promising to inform the design of more effective DSM and DR programs. However, existing clustering-based segmentation methods insufficiently reflect the behavioural diversity of consumers, often relying on rigid temporal alignment, and faltering in the presence of anomalies, missing data, or large-scale deployments. To address these challenges, we propose a novel two-stage clustering framework -- Clustered Representations Optimising Consumer Segmentation (CROCS). In the first stage, each consumer's daily load profiles are clustered independently to form a Representative Load Set (RLS), providing a compact summary of their typical diurnal consumption behaviours. In the second stage, consumers are clustered using the Weighted Sum of Minimum Distances (WSMD), a novel set-to-set measure that compares RLSs by accounting for both the prevalence and similarity of those behaviours. Finally, community detection on the WSMD-induced graph reveals higher-order prototypes that embody the shared diurnal behaviours defining consumer groups, enhancing the interpretability of the resulting clusters. Extensive experiments on both synthetic and real Australian smart meter datasets demonstrate that CROCS captures intra-consumer variability, uncovers both synchronous and asynchronous behavioural similarities, and remains robust to anomalies and missing data, while scaling efficiently through natural parallelisation. These results...


Faster Differentially Private Samplers via Rényi Divergence Analysis of Discretized Langevin MCMC

Neural Information Processing Systems

Various differentially private algorithms instantiate the exponential mechanism, and require sampling from the distribution $\exp(-f)$ for a suitable function $f$. When the domain of the distribution is high-dimensional, this sampling can be challenging. Using heuristic sampling schemes such as Gibbs sampling does not necessarily lead to provable privacy. When $f$ is convex, techniques from log-concave sampling lead to polynomial-time algorithms, albeit with large polynomials. Langevin dynamics-based algorithms offer much faster alternatives under some distance measures such as statistical distance. In this work, we establish rapid convergence for these algorithms under distance measures more suitable for differential privacy. For smooth, strongly-convex $f$, we give the first results proving convergence in R\'enyi divergence. This gives us fast differentially private algorithms for such $f$. Our techniques and simple and generic and apply also to underdamped Langevin dynamics.


SPROCKET: Extending ROCKET to Distance-Based Time-Series Transformations With Prototypes

Harner, Nicholas

arXiv.org Artificial Intelligence

Classical Time Series Classification algorithms are dominated by feature engineering strategies. One of the most prominent of these transforms is ROCKET, which achieves strong performance through random kernel features. We introduce SPROCKET (Selected Prototype Random Convolutional Kernel Transform), which implements a new feature engineering strategy based on prototypes. On a majority of the UCR and UEA Time Series Classification archives, SPROCKET achieves performance comparable to existing convolutional algorithms and the new MR-HY-SP ( MultiROCKET-HYDRA-SPROCKET) ensemble's average accuracy ranking exceeds HYDRA-MR, the previous best convolutional ensemble's performance. These experimental results demonstrate that prototype-based feature transformation can enhance both accuracy and robustness in time series classification.


A novel k-means clustering approach using two distance measures for Gaussian data

Gada, Naitik

arXiv.org Machine Learning

Clustering algorithms have long been the topic of research, representing the more popular side of unsupervised learning. Since clustering analysis is one of the best ways to find some clarity and structure within raw data, this paper explores a novel approach to \textit{k}-means clustering. Here we present a \textit{k}-means clustering algorithm that takes both the within cluster distance (WCD) and the inter cluster distance (ICD) as the distance metric to cluster the data into \emph{k} clusters pre-determined by the Calinski-Harabasz criterion in order to provide a more robust output for the clustering analysis. The idea with this approach is that by including both the measurement metrics, the convergence of the data into their clusters becomes solidified and more robust. We run the algorithm with some synthetically produced data and also some benchmark data sets obtained from the UCI repository. The results show that the convergence of the data into their respective clusters is more accurate by using both WCD and ICD measurement metrics. The algorithm is also better at clustering the outliers into their true clusters as opposed to the traditional \textit{k} means method. We also address some interesting possible research topics that reveal themselves as we answer the questions we initially set out to address.


Newton-Flow Particle Filters based on Generalized Cramér Distance

Hanebeck, Uwe D.

arXiv.org Artificial Intelligence

We propose a recursive particle filter for high-dimensional problems that inherently never degenerates. The state estimate is represented by deterministic low-discrepancy particle sets. We focus on the measurement update step, where a likelihood function is used for representing the measurement and its uncertainty. This likelihood is progressively introduced into the filtering procedure by homotopy continuation over an artificial time. A generalized Cramér distance between particle sets is derived in closed form that is differentiable and invariant to particle order. A Newton flow then continually minimizes this distance over artificial time and thus smoothly moves particles from prior to posterior density. The new filter is surprisingly simple to implement and very efficient. It just requires a prior particle set and a likelihood function, never estimates densities from samples, and can be used as a plugin replacement for classic approaches.


A Algorithms

Neural Information Processing Systems

Missing functions are defined in Appendix A.2. input: L = 400 (see Section B). As outlined above in Section 4.3, we adjust the hardness of the test until the test produces conclusive We opted for this, since it increases the hardness of the test.


Optimal Approximation - Smoothness Tradeoffs for Soft-Max Functions

Neural Information Processing Systems

A soft-max function has two main efficiency measures: (1) approximation - which corresponds to how well it approximates the maximum function, (2) smoothness - which shows how sensitive it is to changes of its input. Our goal is to identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness. This leads to novel soft-max functions, each of which is optimal for a different application. The most commonly used soft-max function, called exponential mechanism, has optimal tradeoff between approximation measured in terms of expected additive approximation and smoothness measured with respect to Rényi Divergence .


Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation

Bastico, Matteo, Ryckelynck, David, Corté, Laurent, Tillier, Yannick, Decencière, Etienne

arXiv.org Artificial Intelligence

As 3D point clouds become a cornerstone of modern technology, the need for sophisticated generative models and reliable evaluation metrics has grown exponentially. In this work, we first expose that some commonly used metrics for evaluating generated point clouds, particularly those based on Chamfer Distance (CD), lack robustness against defects and fail to capture geometric fidelity and local shape consistency when used as quality indicators. W e further show that introducing samples alignment prior to distance calculation and replacing CD with Density-Aware Chamfer Distance (DCD) are simple yet essential steps to ensure the consistency and robustness of point cloud generative model evaluation metrics. While existing metrics primarily focus on directly comparing 3D Euclidean coordinates, we present a novel metric, named Surface Normal Concordance (SNC), which approximates surface similarity by comparing estimated point normals. This new metric, when combined with traditional ones, provides a more comprehensive evaluation of the quality of generated samples. Finally, leveraging recent advancements in transformer-based models for point cloud analysis, such as serialized patch attention, we propose a new architecture for generating high-fidelity 3D structures, the Diffusion Point Transformer . W e perform extensive experiments and comparisons on the ShapeNet dataset, showing that our model outperforms previous solutions, particularly in terms of quality of generated point clouds, achieving new state-of-the-art.


Effectiveness of High-Dimensional Distance Metrics on Solar Flare Time Series

Rohlfing, Elaina, Ahmadzadeh, Azim, Aparna, V

arXiv.org Artificial Intelligence

Solar-flare forecasting has been extensively researched yet remains an open problem. In this paper, we investigate the contributions of elastic distance measures for detecting patterns in the solar-flare dataset, SWAN-SF. We employ a simple $k$-medoids clustering algorithm to evaluate the effectiveness of advanced, high-dimensional distance metrics. Our results show that, despite thorough optimization, none of the elastic distances outperform Euclidean distance by a significant margin. We demonstrate that, although elastic measures have shown promise for univariate time series, when applied to the multivariate time series of SWAN-SF, characterized by the high stochasticity of solar activity, they effectively collapse to Euclidean distance. We conduct thousands of experiments and present both quantitative and qualitative evidence supporting this finding.


Beyond Point Matching: Evaluating Multiscale Dubuc Distance for Time Series Similarity

Ahmadzadeh, Azim, Khazaei, Mahsa, Rohlfing, Elaina

arXiv.org Artificial Intelligence

Abstract--Time series are high-dimensional and complex data objects, making their efficient search and indexing a longstanding challenge in data mining. Building on a recently introduced similarity measure, namely Multiscale Dubuc Distance (MDD), this paper investigates its comparative strengths and limitations relative to the widely used Dynamic Time Warping (DTW). MDD is novel in two key ways: it evaluates time series similarity across multiple temporal scales and avoids point-to-point alignment. We demonstrate that in many scenarios where MDD outperforms DTW, the gains are substantial, and we provide a detailed analysis of the specific performance gaps it addresses. We provide simulations, in addition to the 95 datasets from the UCR archive, to test our hypotheses. Finally, we apply both methods to a challenging real-world classification task and show that MDD yields a significant improvement over DTW, underscoring its practical utility. Time series, or more generally, ordered high-dimensional data types, have become increasingly prevalent with the rise of powerful computational tools and machine learning techniques. In this study, we adopt the term time series as an umbrella label for all such sequential data. A central challenge in analyzing time series lies in defining and measuring similarity. Similarity is inherently subjective, shaped by the specific goals and nuances of a given application. The existing literature has produced a rich landscape of similarity measures, each tailored to specific assumptions and use cases.